ENH: pd.read_html argument to extract hrefs along with text from cells #45973

abmyii · 2022-02-13T17:59:35Z

closes [ENH] pandas.read_html argument to interpret hyperlinks as links (not merely text) #13141
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

phofl

Could you please fix the pre-commit checks?

abmyii · 2022-02-14T12:32:35Z

Could you please fix the pre-commit checks?

Ah I missed that error - sorry about that. I've fixed it now.

pandas/tests/io/test_html.py

doc/source/whatsnew/v1.5.0.rst

attack68 · 2022-02-16T18:38:07Z

This looks fine to me, albeit there is some agreement on the name of the new kwarg. Not convinced by extract_hrefs, but I can't right now suggest anything better.

abmyii · 2022-02-16T21:28:32Z

This looks fine to me, albeit there is some agreement on the name of the new kwarg. Not convinced by extract_hrefs, but I can't right now suggest anything better.

I felt it was more explicit than extract_links - but perhaps that is a better name?

pandas/io/html.py

doc/source/user_guide/io.rst

pandas/io/html.py

jreback · 2022-02-26T01:07:06Z

pandas/io/html.py

@@ -585,6 +624,10 @@ def _parse_tables(self, doc, match, attrs):
            raise ValueError(f"No tables found matching pattern {repr(match.pattern)}")
        return result

+    def _href_getter(self, obj):


can you type the args and returns of all of the added code

I've typed the returns, but won't lxml/bs4 be required to type the args?

@attack68 What shall I do about this?

@jreback Sorry to bother you, but I haven't been able to come up with a solution for this. Could you please suggest how I should do it?

To elaborate a bit on my first comment - the requirements may not be installed, and in that case the typing using the custom types defined in the libraries would fail (as far as I understand), so that doesn't seem like a viable solution.

@mroeschke Would you be able to enlighten me regarding this request? I'm still at a loss as to how to approach it.

At the top of the file you can do:

from typing import TYPE_CHECKING if TYPE_CHECKING: from bf4/lxml import ...

Then type obj. The CI checks have all the optional dependencies installed so these checks should be available.

pandas/io/html.py

pandas/tests/io/test_html.py

pandas/io/html.py

attack68

This looks fine, except I think you need to test all the combinations "header" "footer" etc.

It would be good if you can put it all into one test and use. @pytest.mark.parametrize

Also have you rendered the docs. I think your version added was in the wrong place so it would be good to post a rendering to show the docs are working correctly.

abmyii · 2022-03-18T19:09:08Z

@attack68 Could you help with this request from jreback? I'm not sure how to go about testing it.

need a test that validatees the extract_href arg is among the required values or raises.

abmyii · 2022-06-18T18:20:19Z

lgtm, if rebased and passing tests

Rebased and tests are passing. I've kept the extract_links index structure custom (returns a flat tuple index rather than a MultiIndex) for now, but I'm open to reverting that.

I'm not sure what is causing the Docstring action to fail - the line is totally different to what is shown in the error.

attack68 · 2022-07-02T05:04:23Z

this is failing checks.eg pandas/io/html.py line 1024. you cab review the logs for explanations.

abmyii · 2022-07-02T13:28:20Z

this is failing checks.eg pandas/io/html.py line 1024. you cab review the logs for explanations.

https://github.com/abmyii/pandas/blob/read_html-extract-hrefs/pandas/io/html.py#L1024 is a blank line and 1025 is not a docstring?

Is https://github.com/abmyii/pandas/blob/read_html-extract-hrefs/pandas/io/html.py#L1141 the issue - and if so, should it be .. versionadded :: 1.5.0?

attack68

doc edits

pandas/io/html.py

pandas/tests/io/test_html.py

pandas/io/html.py

mroeschke

Looks fairly good. Just one small comment and a merge conflict

mroeschke · 2022-08-16T18:41:58Z

Thanks for sticking with this @abmyii!

abmyii · 2022-08-16T18:45:53Z

Thanks for sticking with this @abmyii!

Awesome, thank you to all the reviewers, and you and @attack68 especially for helping me throughout this process!

@attack68

pandas-dev#45973) * ENH: pd.read_html argument to extract hrefs along with text from cells * Fix typing error * Simplify tests * Fix still incorrect typing * Summarise whatsnew entry and move detailed explanation into user guide * More flexible link extraction * Suggested changes * extract_hrefs -> extract_links * Move versionadded to correct place and improve docstring for extract_links (@attack68) * Test for invalid extract_links value * Test all extract_link options * Fix for MultiIndex headers (also fixes tests) * Test that text surrounding <a> tag is still captured * Test for multiple <a> tags in cell * Fix all tests, with both MultiIndex -> Index and np.nan -> None conversions resolved * Add back EOF newline to test_html.py * Correct user guide example * Update pandas/io/html.py * Update pandas/io/html.py * Update pandas/io/html.py * Simplify MultiIndex -> Index conversion * Move unnecessary fixtures into test body * Simplify statement * Fix code checks Co-authored-by: JHM Darbyshire <[email protected]>

@attack68

pandas-dev#45973) * ENH: pd.read_html argument to extract hrefs along with text from cells * Fix typing error * Simplify tests * Fix still incorrect typing * Summarise whatsnew entry and move detailed explanation into user guide * More flexible link extraction * Suggested changes * extract_hrefs -> extract_links * Move versionadded to correct place and improve docstring for extract_links (@attack68) * Test for invalid extract_links value * Test all extract_link options * Fix for MultiIndex headers (also fixes tests) * Test that text surrounding <a> tag is still captured * Test for multiple <a> tags in cell * Fix all tests, with both MultiIndex -> Index and np.nan -> None conversions resolved * Add back EOF newline to test_html.py * Correct user guide example * Update pandas/io/html.py * Update pandas/io/html.py * Update pandas/io/html.py * Simplify MultiIndex -> Index conversion * Move unnecessary fixtures into test body * Simplify statement * Fix code checks Co-authored-by: JHM Darbyshire <[email protected]>

phofl reviewed Feb 14, 2022

View reviewed changes

ENH: pd.read_html argument to extract hrefs along with text from cells

d69ce74

abmyii force-pushed the read_html-extract-hrefs branch from 847d930 to d69ce74 Compare February 14, 2022 12:32

Fix typing error

ac86888

attack68 reviewed Feb 14, 2022

View reviewed changes

pandas/tests/io/test_html.py Outdated Show resolved Hide resolved

attack68 reviewed Feb 14, 2022

View reviewed changes

doc/source/whatsnew/v1.5.0.rst Outdated Show resolved Hide resolved

Simplify tests

b33dc9e

abmyii force-pushed the read_html-extract-hrefs branch 2 times, most recently from 50aab0d to 70cf3fa Compare February 15, 2022 17:37

Fix still incorrect typing

a13c5f0

abmyii force-pushed the read_html-extract-hrefs branch from 70cf3fa to a13c5f0 Compare February 15, 2022 21:59

abmyii requested a review from phofl February 15, 2022 22:55

Summarise whatsnew entry and move detailed explanation into user guide

76ebe35

abmyii requested a review from attack68 February 17, 2022 15:44

attack68 reviewed Feb 20, 2022

View reviewed changes

pandas/io/html.py Outdated Show resolved Hide resolved

More flexible link extraction

cd352e7

abmyii requested a review from attack68 February 23, 2022 19:17

jreback added Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Feb 26, 2022

jreback requested changes Feb 26, 2022

View reviewed changes

abmyii added 2 commits February 26, 2022 19:08

Suggested changes

1de1324

extract_hrefs -> extract_links

1190ea7

abmyii requested a review from jreback February 28, 2022 17:39

attack68 reviewed Mar 18, 2022

View reviewed changes

pandas/io/html.py Outdated Show resolved Hide resolved

attack68 reviewed Mar 18, 2022

View reviewed changes

pandas/io/html.py Outdated Show resolved Hide resolved

attack68 requested changes Mar 18, 2022

View reviewed changes

abmyii force-pushed the read_html-extract-hrefs branch from f418975 to 490005a Compare June 18, 2022 10:09

Correct user guide example

a5ff5c1

abmyii requested a review from attack68 June 30, 2022 12:04

Merge branch 'main' into read_html-extract-hrefs

85a183d

attack68 reviewed Jul 29, 2022

View reviewed changes

pandas/io/html.py Outdated Show resolved Hide resolved

pandas/io/html.py Outdated Show resolved Hide resolved

pandas/io/html.py Outdated Show resolved Hide resolved

attack68 added 3 commits July 29, 2022 23:34

Update pandas/io/html.py

58fdb0c

Update pandas/io/html.py

c34d8ff

Update pandas/io/html.py

7389b84

mroeschke reviewed Jul 29, 2022

View reviewed changes

pandas/tests/io/test_html.py Outdated Show resolved Hide resolved

mroeschke reviewed Jul 29, 2022

View reviewed changes

pandas/io/html.py Outdated Show resolved Hide resolved

Simplify MultiIndex -> Index conversion

ba7caab

abmyii requested a review from mroeschke July 30, 2022 11:06

Move unnecessary fixtures into test body

4c7f532

abmyii force-pushed the read_html-extract-hrefs branch from c6b7ea1 to 4c7f532 Compare August 1, 2022 18:14

mroeschke reviewed Aug 15, 2022

View reviewed changes

pandas/io/html.py Outdated Show resolved Hide resolved

mroeschke reviewed Aug 15, 2022

View reviewed changes

abmyii and others added 2 commits August 16, 2022 01:51

Simplify statement

98a46e2

Merge branch 'main' into read_html-extract-hrefs

fd41935

abmyii requested a review from mroeschke August 16, 2022 01:03

Fix code checks

614c636

mroeschke added this to the 1.5 milestone Aug 16, 2022

mroeschke approved these changes Aug 16, 2022

View reviewed changes

mroeschke merged commit 9f81aa6 into pandas-dev:main Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: pd.read_html argument to extract hrefs along with text from cells #45973

ENH: pd.read_html argument to extract hrefs along with text from cells #45973

abmyii commented Feb 13, 2022

phofl left a comment

abmyii commented Feb 14, 2022

attack68 commented Feb 16, 2022

abmyii commented Feb 16, 2022

jreback Feb 26, 2022

abmyii Feb 26, 2022 •

edited

Loading

abmyii Mar 20, 2022

abmyii Jun 14, 2022

abmyii Jul 30, 2022

mroeschke Aug 1, 2022

attack68 left a comment

abmyii commented Mar 18, 2022

abmyii commented Jun 18, 2022

attack68 commented Jul 2, 2022

abmyii commented Jul 2, 2022

attack68 left a comment

mroeschke left a comment

mroeschke commented Aug 16, 2022

abmyii commented Aug 16, 2022

ENH: pd.read_html argument to extract hrefs along with text from cells #45973

ENH: pd.read_html argument to extract hrefs along with text from cells #45973

Conversation

abmyii commented Feb 13, 2022

phofl left a comment

Choose a reason for hiding this comment

abmyii commented Feb 14, 2022

attack68 commented Feb 16, 2022

abmyii commented Feb 16, 2022

jreback Feb 26, 2022

Choose a reason for hiding this comment

abmyii Feb 26, 2022 • edited Loading

Choose a reason for hiding this comment

abmyii Mar 20, 2022

Choose a reason for hiding this comment

abmyii Jun 14, 2022

Choose a reason for hiding this comment

abmyii Jul 30, 2022

Choose a reason for hiding this comment

mroeschke Aug 1, 2022

Choose a reason for hiding this comment

attack68 left a comment

Choose a reason for hiding this comment

abmyii commented Mar 18, 2022

abmyii commented Jun 18, 2022

attack68 commented Jul 2, 2022

abmyii commented Jul 2, 2022

attack68 left a comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

mroeschke commented Aug 16, 2022

abmyii commented Aug 16, 2022

abmyii Feb 26, 2022 •

edited

Loading